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Abstract:  It  is  sometimes  necessary  for  the  owner  of  proprietary  data  to  publicize  some  of  it  while  keeping  the  rest  as 
private.  For  example,  when  releasing  census  data  or  corporate  financial  information,  the  release  must  be  conducted  in  a 
manner  consistent  with  individual  privacy.  The  process  of  publicly  releasing  formerly  private  data  is  called  downgrading. 
However,  it  may  be  possible  to  infer  unreleased  private  information  from  the  downgraded  public  information — the  so- 
called  inference  problem.  Here,  we  discuss  some  of  the  design  decisions  that  we  have  made,  and  continue  to  make, 
concerning  our  prototype  for  a  high  assurance  system  that  evaluates  downgrading  decisions  based  upon  the  amount  of 
private  information  that  may  be  deduced  through  inference.  Our  software  system,  the  Rational  Downgrader,  is  composed 
of  a  knowledge-based  decision  maker  to  determine  the  rules  that  may  be  inferred,  a  GUARD  to  measure  the  amount  of 
leaked  information,  and  a  parsimonious  downgrader  to  modify  the  initial  downgrading  decisions.  At  present,  we  have 
restricted  the  Rational  Downgrader  to  relational  databases.  Of  course,  the  underlying  theories  apply  to  all  forms  of  data. 
In  this  paper,  we  concentrate  on  design  decisions  made  with  the  aim  of  achieving  high  assurance  with  respect  to  an 
optimality  condition. 

1.  INTRODUCTION 

We  feel  that  since  downgrading  is  necessary,  it  should  be  done  in  a  high  assurance  manner.  Inference  problems  must  be 
analyzed  and  controlled.  We  propose  the  Rational  Downgrader  as  a  high  assurance  device  to  perform  downgrading.  The 
goal  of  the  Rational  Downgrader  is  to  mitigate  the  inference  of  private  information  from  information  that  is  publicly 
available.  The  design  goal  of  the  Rational  Downgrader  is  to  satisfy  an  assurance  policy— the  policy  that  an  unqualified 
user  cannot  infer  private  information.  In  practice,  total  assurance  might  be  an  unobtainable  Holy  Grail  of  perfection. 
However,  we  feel  that  our  tool  can  be  utilized  to  achieve  a  pragmatic  level  of  assurance.  Our  preliminary  work  in  this  area 
is  described  in  [CM98a],  [CM98b],  [MC99a],  and  [MC99b]. 

At  present,  we  focus  our  attention  on  relational  databases  and  classification  rules  (class  labels).  We  are  starting 
experiments  with  the  Rational  Downgrader  on  the  UC  Irvine  machine  learning  repository  [UCI]  and  plan  to  use  our 
prototype  on  other  publicly  available  databases. 

2.  DOWNGRADING 

There  exists  a  relational  database  DB  and  initially  all  rows  of  DB  are  considered  private.  The  user  or  managing  authority 
of  all  the  information  in  DB  is  called  High.  Every  case  in  DB  is  described  by  its  row.  A  row  is  specified  by  its  key  k.  Row 
k,  rk,  is  a  (n+1  (-tuple  and  the  tuple  entries  are  the  attribute  values  for  rk.  The  last  attribute  value  is  special  and  is  referred  to 
as  the  class  label.  To  avoid  confusion,  we  will  reserve  the  term  “attribute”  and  “attribute  value”  for  the  first  n  entries  of 
any  row.  It  is  possible  for  an  attribute  value  to  be  missing,  which  is  denoted  by  placing  a  ?  in  that  entry.  Class  labels  are 
never  missing  in  DB.  High  determines,  based  upon  reasons  of  system  safety,  business  decisions,  politics,  timing,  etc., 
which  rows  are  truly  sensitive— the  private  rows— and  which  rows  need  no  longer  be  considered  private— the  public  rows. 
We  call  this  downgrading  the  rows.  Two  new  databases,  Hdb  and  Ldb,  are  formed.  Hdb  is  the  same  as  DB  except  for  the 
designation  of  rows  as  either  private  or  public.  The  row  keys  of  Ldb  are  the  same  as  the  row  keys  of  Hdb.  If  rk  is  public  in 
Hdb,  then  rk  in  Ldb  is  identical  to  rk  in  Hdb.  If  rk  is  private  in  Hdb,  then  the  n  attribute  values  of  rk  in  Ldb  are  the  same  as  in 
Hdb.  However,  the  class  label  of  rk  in  Ldb  is  a  trussing  value.  In  other  words,  it  is  the  association  of  a  class  label  with  the 
attributes  in  a  private  row  that  is  proprietary.  The  interested  reader  is  referred  to  [MC99b]  for  details. 

Downgrading  by  focusing  upon  each  row  as  a  separate  entity,  and  not  on  a  series  of  rows  in  conjunction  with 
each  other  will  not  detect  many  inferences.  (Our  inference  problems  are  different  from  the  important  work  done  on 
microdata  disclosure  problems  using  contingency  tables  and  statistical  databases,  e.g.,  [DL],  We  focus  on  categorical 
relational  databases.) 

Our  assurance  concern  is  the  ability  of  a  user  of  Lib,  called  Low ,  to  infer  the  missing  class  label  associated  with 
private  rows.  We  restate  our  assurance  problem  as 

the  ability  of  Low  to  infer  an  associated  private  class  label. 

Note  that  the  assurance  problem  is  not  just  the  class  label,  but  the  fact  that  the  class  label  is  associated  with  a  specific 
private  row.  The  attributes  in  a  private  row  of  Ldb  are  not  of  concern.  It  is  assumed  that  those  attributes  alone  cannot  be 
used  by  Low  to  learn  information  about  the  missing  class  label  in  a  specified  private  row.  Rather,  our  concern  is  that 
knowledge  of  the  public  rows  can  assist  Low  in  learning  the  missing  associated  class  labels. 
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3.  MODULAR  DESIGN 
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The  RATIONAL  DOWNGRADER  [MC99b]  is  comprised  of  three  modular  components  that  may  be  replaced  and 
upgraded  as  assurance,  performance,  and  availability  of  software  changes. 

The  components  are  the  knowledge-based  decision  maker  (DM),  the  GUARD,  and  the  Parsimonious 
Downgrader.  After  it  has  performed  its  initial  downgrading.  High  itself,  (or  some  authority  supervising  High),  is  the 
“operator”  of  the  Rational  Downgrader. 

Let  us  study  the  first  component  of  the  Rational  Downgrader— DM.  The  database  Ldb  can  be  decomposed  into 
two  databases:  L"db  and  L+db  The  database  L"db  consists  of  the  private  rows  with  the  class  labels  missing  and  L+db  consists 
of  the  (downgraded)  public  rows.  L+db  is  fed  into  DM.  DM  produces  classification  rules  since  our  concern,  in  the  prototype 
Rational  Downgrader,  is  the  missing  associated  class  labels.  We  use  the  standard  measurements  of  rule  strength  from  the 
field  of  KDD  (e.g.  [AIS],  [MS]).  They  are 

support  =  (number  of  rows  in  which  the  rule  is  correct  -r  total  number  of  rows),  and 
confidence  =  ( number  of  rows  in  which  the  rule  is  correct  -r  number  of  rows  where  the  attribute  values  agree  with  rule 

antecedent ). 


DBA  downgraded  L+db  A 


A  Rule  Set 


The  strength  of  a  rule  is  the  2-tuple  (support,  confidence).  As  a  research  prototype,  we  use  C4.5  [Q]  for  DM.  C4.5  is  a 
very  popular  decision  tree  algorithm  that  produces  inferential  rules.  C4.5  uses  L+db  as  training  data  and  views  L"db  as  the 
test  data.  Of  course,  this  can  be  replaced  by  other  sound  knowledge-based  inference  systems  and  frameworks  or 
combinations  thereof.  A  DM  such  as  CBA  [HM]  or  Bayesian  methods  e.g.,  [CM98b]  would  also  be  applicable  and  we  are 
studying  their  utilization. 


4.  INFERENCE 

In  [MC99b]  we  generalized  a  definition  of  inference  given  in  [MC99a]  that  we  discuss  here.  Let  A  be  the  categorical 

random  variable  representing  the  distribution  of  the  missing  associated  class  labels  in  L"db.  Let  B  be  the  categorical 

random  variable  representing  the  distribution  of  the  missing  associated  class  labels  in  Ldb.  For  both  A  and  B  we  are 

making  a  closure  assumption  that  the  set  of  possible  outcomes  are  known,  are  the  same,  and  are  exhausted  by  the  class 

labels  of  Ldb.  Thus,  A  and  B  describe  the  same  outcomes  but  A  is  based  only  on  the  information  in  the  private  rows, 
whereas  B  is  based  on  information  in  all  of  Ldb. 

DEFINITION  1:  If  A  =  B,  then  we  have  perfect  noninference;  if  A  -t  B,  then  we  have  inference. 

The  type  and  degree  of  inference  is  what  we  wish  to  measure.  The  confidence  of  the  rules  goes  into  the 
probability  calculations  and  the  support  of  the  rules  gives  a  measurement  of  the  strength  of  the  inferences.  Keep  in  mind 
that  the  method  and  parameters  used  to  generate  the  rules  (e.g.  C4.5  vs.  ID3  vs.  CBA,  etc.)  will  influence  whether  or  not 
we  “have”  inference.  Thus,  future  variants  of  the  Rational  Downgrader  might  use  a  mixture  of  techniques. 

We  want  to  minimize  inference  as  much  as  possible.  We  accomplish  this  minimization  by  changing  certain 
attribute  values  to  missing  values  in  the  public  rows.  The  GUARD,  which  we  discuss  next,  is  the  component  that  would 
allow  a  benign  inference,  since,  according  to  our  assurance  policy,  no  harm  would  occur. 


5.  THE  GUARD 

Rules  learned  from  DM  are  applied  to  L"db  As  noted  we  view  L+db  as  training  data  and  L"db  as  test  data.  The  information 
learned  from  L"db  via  the  DM  rules  must  now  be  measured.  This  is  the  function  of  the  GUARD. 


Rule  Set  ©  L"db  A 


GUARD 


=  pass  or  fail 
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The  GUARD  must  determine  if  the  assurance  policy  is  satisfied.  It  does  this  based  upon  the  following  criteria: 

1- Has  inference  occurred? 

2- Is  the  inference  malevolent? 

3- Is  the  malevolent  inference  associated  with  rules  that  have  strong  support? 

These  components  determine  whether  we  have  an  assurance  problem  or  not.  Item  3  is  especially  subjective  (we  will  not 
discuss  specific  criteria).  If  High  has  assurance  that  private  information  will  not  be  leaked,  our  analysis  is  complete 
(pass).  However,  what  if  Low  is  able  to  glean  sensitive  information  from  the  public  rows — the  inference  problem  (fail)? 
High  must  reconsider  the  decisions  that  it  made  when  it  initially  downgraded  the  information.  To  the  best  of  our 
knowledge,  there  is  no  software  tool  that  accomplishes  this.  This  part— parsimonious  downgrading— is  an  integral  part  of 
the  Rational  Downgrader. 


6.  PARSIMONIOUS  DOWNGRADING 


If  GUARD  =  fail,  then  DB  -> 


Parsimonious  Downgrader 
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Parsimonious  downgrading  was  introduced  in  [CM98b],  Parsimonious  downgrading  calls  the  initial  downgrading 
decisions  into  question  and  views  these  decisions  as  being  too  liberal  if  possible  malevolent  inferences  are  not  taken  into 
account.  Parsimonious  downgrading  is  the  process  of  adjusting  the  downgrading  process  by  making  less  information 
publicly  available  by  inserting  missing  values  for  some  of  the  attribute  values  in  the  public  rows.  This  lessening  of  the 
amount  of  public  information  causes  the  DM  to  produce  weaker  inference  rules.  We  call  this  rule  confusion.  The  net 
effect  of  rule  confusion  is  to  lessen  Low's  ability  to  infer  the  private  class  labels.  Of  course,  hiding  additional  public 
information  negatively  affects  the  functionality  provided  by  the  public  rows  to  Low,  so  it  must  be  done  judiciously.  We 
call  the  database  of  adjusted  (via  the  missing  values)  public  rows  Ldb.  It  is  decomposed  (as  before)  into  L+db  and  L~db;  of 
course,  L"db  is  just  L"db.  Future  development  will  also  allow  the  Rational  Downgrader  to  insert  missing  values  into  L"db. 
We  stay  away  from  that  approach  for  now  since,  for  a  “small”  amount  of  private  rows,  deletion  of  attribute  values  would 
cause  serious  performance  damage.  Also,  if  the  rows  are  fdled  in  a  temporal  manner  and  L"db  is  instantiated  after  L+db,  the 
Rational  Downgrader  could  only  assure  rule  confusion  ahead  of  time  by  manipulating  L+db. 

We  assign  a  metric  to  the  loss  of  functionality;  i.e.,  a  penalty.  A  missing  value  in  attribute  j  of  r;  is  given  a  score 
of  w(i,j).  The  simplest  scenario,  called  the  Unity  scenario,  occurs  when  w(i,j)  is  set  equal  to  the  constant  value  1. 

Our  goal  is  to  weaken  the  rule  set  while  minimizing  the  penalty.  What  does  it  mean  to  weaken  the  rule  set?  We 
are  only  interested  in  rules  that  affect  L"db.  We  are  tacitly  assuming  that  Low  can  run  whatever  knowledge  discovery 
engine  that  High  does.  Let  us  elaborate  on  this. 

We  assume  that  High  wants  to  muddle  any  rules  that  can  be  applied  to  L"db,  and  Low  is  aware  of  this  strategy. 
Thus,  we  see  that  an  integral  part  of  parsimonious  downgrading  is  the  penalty.  With  the  penalty  in  mind,  we  note  that  one 
cannot  derive  any  inference  from  a  brick,  but  neither  is  a  brick  of  functional  use. 
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After  High  has  performed  parsimonious  downgrading  within  the  bounds  set  by  an  agreed-upon  penalty  function,  the 
process  must  start  again  with  the  Low  database  being  sent  into  DM.  After  DM  extracts  the  rules  and  applies  them  to  L"db 
and  the  inferences  have  been  determined,  the  GUARD  determines  the  pass/fail  decision.  If  “pass,”  then  High  is  done. 
High  can  now  have  an  assurance  that  Low  will  not  learn  things  it  is  not  intended  to  learn.  However,  if  the  GUARD 
determines  “fail,”  then  the  process  must  be  restarted.  One  must  be  careful  not  to  get  into  complex  loops  of  sending  the 
data  through  the  Rational  Downgrader.  Trade-offs  between  assurance  and  functionality  penalties  must  be  evaluated  and 
we  allow  sub-optimality  in  this  respect.  (In  the  present  model,  human  intervention  is  allowed  to  accomplish  this.) 


7.  PARSIMONIOUS  DOWNGRADING  DESIGN  DECISIONS 
7.1  A  concrete  example. 

Table  1  represents  our  database.  The  database  has  four  attributes  and  one  class  label— “sunburn”  (the  row  number 
identifiers  are  not  considered  an  attribute).  The  entire  database  is  designated  DB.  The  first  19  rows  are  considered  public 
and  they  make  up  L+db  The  last  9  rows  are  considered  private.  Therefore,  we  see  that  the  database  L"db  is  the  last  9  rows 
with  missing  class  labels.  Therefore,  the  database  Ldb  is  given  by  Table  2. 

It  is  our  desire  to  deny  Low  the  ability  to  infer  the  missing  class  labels  (with  their  row  associations).  Using  L+db 
as  training  data,  and  L"db  as  test  data,  we  run  C4.5  (in  the  default  mode  with  the  -u  -f  options)  and  obtain  the  following 
rules  (see  Figure  2): 

RULE  1:  “hair  =  brown"  =>  N 

RULE  2:  “hair  =  red”  =>  S 
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RULE  3:  “hair  =  blonde”  A  “lotion  =  no”  =>  S 

RULE  4:  “hair  =  blonde”  A  “lotion  =  some”  =>  M 

RULE  5:  “hair  =  blonde”  A  “lotion  =  yes”  =>  N 

According  to  the  training  data,  RULE  1  is  correct  with  confidence  3  out  of  3  (written  3/3).  RULE  2  is  correct  6/6,  RULE  3 
is  correct  34,  RULE  4  is  correct  2/2,  and  RULE  5  is  correct  %.  Low  knows  these  rules  and  if  Low  applies  them  to  L"db, 
Low  would  obtain  every  missing  class  label  except  for  row  24,  which,  by  using  C4.5,  Low  would  misclassify  as  M  instead 
of  S.  The  ability  for  Low  to  learn  8  correct  associated  class  labels  is  not  acceptable.  High  is  keeping  very  little  away  from 
Low  by  downgrading  rows  20  through  28  with  missing  class  labels. 

Now  High  decides  to  perform  parsimonious  downgrading  and  insert  missing  values  in  for  some  of  the  attribute 
values  in  rows  1  through  19.  For  simplicity,  we  assume  a  penalty  function  of  1.  We  further  assume  a  maximum  cost  of  5. 
Therefore,  only  5  missing  values  may  be  inserted  (of  course  if  less  than  5  will  imply  the  same  rule  confusion  then  we 
should  do  that.)  A  naive  approach 
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red 
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S 

Table  1 

would  be  to  go  through  all  possible  ways  of  putting  5  missing  values  in  for  attribute  values  in  L+db.  This  approach  is 
computational  infeasible  since  the  combinatorial  possibilities  grows  exponentially.  Instead,  here  we  make  a  design 
decision  and  attempt  an  information  theoretical  approach  to  inserting  missing  values.  We  call  this  the  rule-based  approach, 
since  we  use  the  actual  rules  generated  by  C4.5  to  decide  where  to  insert  the  missing  values. 

7.2  Rule-based  missing  values. 

C4.5  generates  rules  via  the  Quinlan  [Q]  gain  condition.  This  is  basically  the  normalized  mutual  information  between 
attributes  and  the  class  label.  Of  course,  C4.5  is  rather  sophisticated,  since  rule  pruning  is  performed  via  various  statistical 
techniques,  but  the  gist  is  still  that  of  information  theory.  The  rules  represent  the  strongest  dependencies  between  the 
attributes  and  the  class  labels  in  an  efficient  manner.  We  exploit  these  dependencies  in  deciding  where  High  should  insert 
missing  values.  We  exploit  the  decision  trees  associated  with  the  rules.  The  rule  clauses  are  generated  in  descending  path 
order  down  the  decision  tree.  This  order  represents  the  information  theoretical  interaction  between  the  attributes  and  the 
class  label. 

Step  1:  See  which  rules  are  needed  to  classify  L"db.  Call  these  rules  the  kernel.  (In  our  above  example,  all  of  L+db  is  the 
kernel.)  From  L+db  we  delete  any  cases  which  do  not  support  a  rule  in  the  kernel  and  from  these  cases  we  delete  attributes 
that  are  not  represented  in  the  rules  clauses.  In  the  above  example,  this  leaves  us  with  Table  3.  Keep  in  mind  the  last 
column  of  Table  3  is  still  the  class  label  while  the  first  two  columns  (we  do  not  include  the  first  “column”  which  is  just  a 
designator  for  the  row,  or  case,  number)  are  the  attribute  values.  It  is  from  these  attribute  columns  that  we  change 
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row 

hair  color 

height 

weight 

lotion  use 

sunburn 

i 

blonde 

average 

light 

yes 

N 

2 

blonde 

average 

heavy 

yes 

N 

3 

blonde 

short 

average 

yes 

N 

4 

blonde 

tall 

heavy 

no 

N 

5 

blonde 

tall 

average 

yes 

M 

6 

blonde 

short 

heavy 

some 

M 

7 

blonde 

average 

light 

some 

M 

8 

blonde 

short 

light 

no 

S 

9 

blonde 

short 

average 

no 

S 

10 

blonde 

tall 

light 

no 

S 

11 

brown 

tall 

heavy 

no 

N 

12 

brown 

average 

light 

no 

N 

13 

brown 

short 

average 

some 

N 

14 

red 

average 

light 

some 

S 

15 

red 

tall 

heavy 

no 

S 

16 

red 

average 

light 

no 

s 

17 

red 

average 

average 

no 

s 

18 

red 

short 

average 

no 

s 

19 

red 

average 

light 

some 

s 

20  1 

blonde 

tall 

heavy 

yes 

? 

21  2 

blonde 

short 

heavy 

some 

? 

22  3 

blonde 

average 

heavy 

yes 

? 

23  4 

blonde 

short 

average 

no 

? 

24  5 

blonde 

tall 

light 

some 

? 

25  6 

brown 

average 

light 

no 

? 

26  7 

brown 

short 

average 

some 

? 

00 

r~~ 

CN 

red 

short 

average 

no 

? 

|  28  9 

red 

short 

light 

some 

? 

Table  2 

instantiations  into  missing  values.  This  is  all  done  to  optimize  the  rule  confusion.  The  GUARD  determines  the  acceptable 
level  of  rule  confusion.  For  now  we  are  assuming  that  the  GUARD  wishes  to  maximize  the  rule  confusion  (in  general,  the 
GUARD  will  perform  a  playoff  against  the  functionality). 

The  agreed  number  of  missing  values  are  now  substituted  for  attribute  values  in  the  kernel  rules.  C4.5  is  run  on 
this  parsimoniously  downgraded  L+db.  For  every  test  (private)  class  label  that  is  misclassified,  we  assign  the  confidence 
that  the  rule  has  in  this  misclassification.  These  values  are  then  added  together.  (Note  that  we  are  not  giving  any  test  case 
preference  over  another.  We  can  accomplish  this  by  weighting  the  confidences  of  the  miscalculations.)  This  score  makes 
up  the  confusion  value  assigned  to  this  assignment  of  missing  values.  This  is  done  (which  is  still  a  huge  computational 
effort)  for  every  possible  assignment  of  the  n  missing  values.  We  collect  the  set  of  assignments  that  results  in  the  maximal 
confusion  value  (if  there  are  no  misclassifications,  a  similar  approach  is  used  with  respect  to  the  confidence  values  of  the 
correct  classifications).  The  GUARD  then  randomly  picks  one  of  these  assignments  of  missing  values  as  the  optimal 
assignment  of  missing  values. 

Our  method  gives  a  computationally  quicker  method  than  just  assigning  randomly  the  missing  values. 
Unfortunately,  our  method  is  still  quite  computationally  complex  due  the  factorial  nature  of  assigning  the  missing  values. 
The  second  step  is  an  attempt  to  continue  Quinlan’s  analogy  with  noisy  communication  channels  by  determining  an 
entropy-based  approach  to  assigning  missing  values.  We  also  will  investigate  inserting  missing  values  into  L"db,  instead  of 
just  L+db. 

With  respect  to  the  example  given  in  Tables  1  &  2,  we  have  not  gone  through  all  29  choose  5  (approximately 
120  000)  ways  of  assigning  missing  values.  For  our  example,  we  conclude  with  some  discussion.  If  we  only  have  5 
missing  values  to  work  with  we  would  not  concentrate  on  rows  8  or  9  of  L"db  This  is  because  those  rows  need  RULE  2  to 
infer  their  class  label.  To  confuse  RULE  2,  the  Rational  Downgrader  would  need  to  insert  all  5  missing  values  for  any  5 
hair  color  attribute  values  in  rows  14  through  19  of  L+db.  However,  C4.5  is  still  able  to  produce  a  rule  that  infers  that  the 
class  labels  for  rows  8  and  9  of  L"db  are  S.  An  “intelligent”  approach  would  attempt  to  achieve  the  biggest  bang  for  the 
buck  by  analyzing  how  many  missing  values  must  be  inserted  to  confuse  a  rule  in  conjunction  with  how  many  rows  of  L"db 
need  this  rule  for  classification  purposes.  By  contrast,  we  see  that  inserting  3  missing  values  for  the  “lotion  use”  attribute 
in  three  of  rows  1,2,3,  or  5  of  L+db  would  confuse  RULE  5  and  would  not  allow  Low  to  learn  the  class  labels  of  rows  1  or 
3. 
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8.  PRESENT  SOLUTION  AND  CONCLUSION 


This  idea  of  rule  confusion  must  also  take  into  account  the  amount  of  confidence  C4.5  generates  along  with  the 
(hopefully)  erroneous  rules.  In  order  to  bypass  an  exhaustive  search  on  the  attribute  values,  the  Rational  Downgrader  is 
presently  implemented  by  using  an  informative  search.  It  makes  suggestions  based  upon  a  penalty  function  as  to  where  to 
put  a  missing  value,  and  selects  the  one  with  a  minimal  penalty.  Then,  after  inserting  the  missing  value  the  Rational 
Downgrader  is  run  again  to  determine  the  next  missing  value.  Globally  this  may  not  be  the  optimal  solution  but  we  must 
have  a  trade-off  against  impractical  complexity  issues.  The  proposed  penalty  function  takes  into  account  rule  confusion, 
rule  confidence,  and  how  many  private  rows  are  affected  from  rule  confusion.  We  are  attempting  to  automate  this  process 
and  take  into  account  the  number  of  training  cases,  the  number  of  test  cases  associated  with  a  leaf,  and  the  classification 
error  of  that  leaf.  Note  that  our  approach  is  in  the  area  of  data  mining.  Data  miners  have  successfully  used  simulated 
annealing  [EBM]  and  genetic  algorithms  [CAF]  to  expedite  their  searches.  We  are  attempting  to  see  if  such  techniques 
will  work  with  the  Rational  Downgrader. 

Informative  search  is  a  greedy  search  so,  instead  of  optimality,  we  are  willing  to  accept  a  certain  level  of 
inference  confusion.  We  hope  to  fully  automate  our  process  and  improve  our  search  methods.  At  this  stage  we  feel  that 
further  experimentation  is  in  order  and  we  are  presently  running  additional  experiments. 


row 

hair  color 

lotion  use 

sunburn 

i 

blonde 

yes 

N 

2 

blonde 

yes 

N 

3 

blonde 

yes 

N 

4 

blonde 

no 

N 

5 

blonde 

yes 

M 
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blonde 

some 

M 
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blonde 

some 

M 
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blonde 

no 

S 

9 

blonde 

no 

S 

10 

blonde 

no 

S 

11 

brown 

N 

12 

brown 

N 

13 

brown 

N 

14 

red 

S 

15 

red 

S 

16 

red 

s 

17 

red 

s 

18 

red 

s 

19 

red 

s 

Table  3 
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Because  of  our  experimentation,  we  have  come  up  with  an  efficient  method  of  parsimoniously  downgrading  the 
database  in  our  example  (Table  1).  Our  experiment  utilized  the  fact  the  decision  maker  C4.5  is  based  upon  valid  statistical 
and  information  theoretic  principals.  Keeping  this  in  mind  along  with  our  comments  about  intelligently  inserting  missing 
values  (section  7.2)  we  found  that  inserting  missing  values  as  shown  in  Table  4  results  in  Low  misclassifying  5  class 
labels  (r2i,  r23,  r24,r25,  r26).  This  is  an  increase  of  4  over  the  original  L+db.  No  other  assignment  of  missing  values  will  cause 
more  than  5  misclassifications  or  have  more  inaccuracy  than  that  generated  by  using  Table  4.  However,  there  are  other 
assignments  of  the  5  missing  values  that  result  in  the  same  rule  confusion  as  Table  4.  We  are  not  looking  for  uniqueness, 
only  existence. 

Much  still  has  to  be  done;  however,  we  feel  that  our  early  prototyping  efforts  have  shown  a  proof  of  concept  and 
the  Rational  Downgrader  will  become  a  useful  high  assurance  tool  to  assist  in  privacy  and  downgrading  efforts. 
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hair  color 

lotion  use 

sunburn 
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blonde 

yes 
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blonde 

yes 
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blonde 

yes 
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blonde 

no 
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blonde 

yes 
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blonde 
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blonde 

? 
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blonde 

no 

S 
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blonde 

no 
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10 

blonde 

no 

S 

11 

? 

N 

12 

7 

N 

13 

7 

N 

14 

red 

S 

15 

red 

S 

16 

red 

s 

17 

red 

s 

18 

red 

s 

19 

red 

s 

Table  4 
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